PaliGemma/[PaliGemma_1]Image

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "N0WWINoe_H78" }, "source": [ "##### Copyright 2024 Google LLC." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "KCbtuTFR_Qot" }, "outputs": [], "source": [ "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "UHNTG5-wNyGh" }, "source": [ "# Image captioning using PaliGemma\n", "In this notebook, we'll explore image captioning using PaliGemma, a state-of-the-art vision-language model developed by Google. PaliGemma is designed to understand both images and text, making it ideal for generating accurate and descriptive captions for a wide range of images.\n", "\n", "Image captioning plays a crucial role in making the web accessible to everyone, particularly individuals who are blind or visually impaired. While alternative text (alt text) provides a concise description of an image, captions offer a more comprehensive explanation, conveying the context, details, and nuances that might be missed in a brief alt text. This ensures that all users, regardless of their visual abilities, can fully understand and appreciate the content of images on websites, contributing to a more inclusive and equitable online experience.\n", "\n", "\n", "<table align=\"left\">\n", " <td>\n", " <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/PaliGemma/[PaliGemma_1]Image_captioning.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n", " </td>\n", "</table>\n" ] }, { "cell_type": "markdown", "metadata": { "id": "00NyqqWcAXw1" }, "source": [ "## Setup\n", "\n", "### Select the Colab runtime\n", "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you should use a L4 GPU or an A100 GPU, as a T4 will be insufficient:\n", "\n", "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n", "2. Select **Change runtime type**.\n", "3. Under **Hardware accelerator**, select **L4 GPU** or **A100 GPU**.\n", "\n", "\n", "### Gemma setup on Kaggle\n", "To complete this tutorial, you'll first need to complete the setup instructions at [Gemma setup](https://ai.google.dev/gemma/docs/setup), as PaliGemma is a Gemma variant.\n", "\n", "In brief, you will need to\n", "\n", "* Get access to Gemma on kaggle.com.\n", "* Generate and configure a Kaggle username and API key.\n", "\n", "After you've completed the Gemma setup, move on to the next section, where you'll set your username and API key as environment variables for your Colab environment." ] }, { "cell_type": "markdown", "metadata": { "id": "YQYPV5urTkt1" }, "source": [ "## Accessing Kaggle Credentials\n", "\n", "We will need to provide our Kaggle username and API key in order to download the PaliGemma model from Kaggle.\n", "\n", "The code below fetches these credentials from the Google Colab user data, avoiding the need to expose them directly in the notebook.\n", "\n", "If you haven't already, set your Kaggle username and API key appropriately in your Colab user data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "V50amY4Ik6Va" }, "outputs": [], "source": [ "import os\n", "from google.colab import userdata\n", "\n", "os.environ[\"KAGGLE_USERNAME\"] = userdata.get(\"KAGGLE_USERNAME\")\n", "os.environ[\"KAGGLE_KEY\"] = userdata.get(\"KAGGLE_KEY\")" ] }, { "cell_type": "markdown", "metadata": { "id": "jGELpI_2TYVr" }, "source": [ "## Installing Required Libraries\n", "\n", "Before we dive into using PaliGemma, let's make sure we have all the necessary libraries installed. The following commands will upgrade `keras-cv`, `keras-nlp`, and `keras` to their latest versions, ensuring we have access to the most up-to-date features and improvements for working with vision and language models." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3p9xWE-0TawJ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting keras-cv\n", " Downloading keras_cv-0.9.0-py3-none-any.whl (650 kB)\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/650.7 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m368.6/650.7 kB\u001b[0m \u001b[31m10.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m650.7/650.7 kB\u001b[0m \u001b[31m14.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras-cv) (24.0)\n", "Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras-cv) (1.4.0)\n", "Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from keras-cv) (2023.12.25)\n", "Requirement already satisfied: tensorflow-datasets in /usr/local/lib/python3.10/dist-packages (from keras-cv) (4.9.4)\n", "Collecting keras-core (from keras-cv)\n", " Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/950.8 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m950.8/950.8 kB\u001b[0m \u001b[31m59.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: kagglehub in /usr/local/lib/python3.10/dist-packages (from keras-cv) (0.2.5)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kagglehub->keras-cv) (2.31.0)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from kagglehub->keras-cv) (4.66.4)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras-core->keras-cv) (1.25.2)\n", "Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras-core->keras-cv) (13.7.1)\n", "Collecting namex (from keras-core->keras-cv)\n", " Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)\n", "Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras-core->keras-cv) (3.9.0)\n", "Requirement already satisfied: dm-tree in /usr/local/lib/python3.10/dist-packages (from keras-core->keras-cv) (0.1.8)\n", "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (8.1.7)\n", "Requirement already satisfied: etils[enp,epath,etree]>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (1.7.0)\n", "Requirement already satisfied: promise in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (2.3)\n", "Requirement already satisfied: protobuf>=3.20 in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (3.20.3)\n", "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (5.9.5)\n", "Requirement already satisfied: tensorflow-metadata in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (1.15.0)\n", "Requirement already satisfied: termcolor in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (2.4.0)\n", "Requirement already satisfied: toml in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (0.10.2)\n", "Requirement already satisfied: wrapt in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (1.14.1)\n", "Requirement already satisfied: array-record>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow-datasets->keras-cv) (0.5.1)\n", "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets->keras-cv) (2023.6.0)\n", "Requirement already satisfied: importlib_resources in /usr/local/lib/python3.10/dist-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets->keras-cv) (6.4.0)\n", "Requirement already satisfied: typing_extensions in /usr/local/lib/python3.10/dist-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets->keras-cv) (4.11.0)\n", "Requirement already satisfied: zipp in /usr/local/lib/python3.10/dist-packages (from etils[enp,epath,etree]>=0.9.0->tensorflow-datasets->keras-cv) (3.18.2)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-cv) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-cv) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-cv) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-cv) (2024.2.2)\n", "Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from promise->tensorflow-datasets->keras-cv) (1.16.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras-core->keras-cv) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras-core->keras-cv) (2.16.1)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras-core->keras-cv) (0.1.2)\n", "Installing collected packages: namex, keras-core, keras-cv\n", "Successfully installed keras-core-0.1.7 keras-cv-0.9.0 namex-0.0.8\n", "Collecting keras-nlp\n", " Downloading keras_nlp-0.12.1-py3-none-any.whl (570 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m570.5/570.5 kB\u001b[0m \u001b[31m12.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: keras-core in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (0.1.7)\n", "Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (1.4.0)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (1.25.2)\n", "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (24.0)\n", "Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (2023.12.25)\n", "Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (13.7.1)\n", "Requirement already satisfied: dm-tree in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (0.1.8)\n", "Requirement already satisfied: kagglehub in /usr/local/lib/python3.10/dist-packages (from keras-nlp) (0.2.5)\n", "Collecting tensorflow-text (from keras-nlp)\n", " Downloading tensorflow_text-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.2/5.2 MB\u001b[0m \u001b[31m76.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kagglehub->keras-nlp) (2.31.0)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from kagglehub->keras-nlp) (4.66.4)\n", "Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras-core->keras-nlp) (0.0.8)\n", "Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras-core->keras-nlp) (3.9.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras-nlp) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras-nlp) (2.16.1)\n", "Collecting tensorflow<2.17,>=2.16.1 (from tensorflow-text->keras-nlp)\n", " Downloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (589.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m589.8/589.8 MB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras-nlp) (0.1.2)\n", "Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (1.6.3)\n", "Requirement already satisfied: flatbuffers>=23.5.26 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (24.3.25)\n", "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (0.5.4)\n", "Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (0.2.0)\n", "Collecting h5py (from keras-core->keras-nlp)\n", " Downloading h5py-3.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.3/5.3 MB\u001b[0m \u001b[31m112.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (18.1.1)\n", "Collecting ml-dtypes~=0.3.1 (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp)\n", " Downloading ml_dtypes-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.2/2.2 MB\u001b[0m \u001b[31m99.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (3.3.0)\n", "Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (3.20.3)\n", "Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (67.7.2)\n", "Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (1.16.0)\n", "Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (2.4.0)\n", "Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (4.11.0)\n", "Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (1.14.1)\n", "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (1.64.0)\n", "Collecting tensorboard<2.17,>=2.16 (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp)\n", " Downloading tensorboard-2.16.2-py3-none-any.whl (5.5 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m109.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hCollecting keras>=3.0.0 (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp)\n", " Downloading keras-3.3.3-py3-none-any.whl (1.1 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m81.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (0.37.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-nlp) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-nlp) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-nlp) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->kagglehub->keras-nlp) (2024.2.2)\n", "Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (0.43.0)\n", "Collecting optree (from keras>=3.0.0->tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp)\n", " Downloading optree-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (311 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m311.2/311.2 kB\u001b[0m \u001b[31m38.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (3.6)\n", "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (0.7.2)\n", "Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (3.0.3)\n", "Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow<2.17,>=2.16.1->tensorflow-text->keras-nlp) (2.1.5)\n", "Installing collected packages: optree, ml-dtypes, h5py, tensorboard, keras, tensorflow, tensorflow-text, keras-nlp\n", " Attempting uninstall: ml-dtypes\n", " Found existing installation: ml-dtypes 0.2.0\n", " Uninstalling ml-dtypes-0.2.0:\n", " Successfully uninstalled ml-dtypes-0.2.0\n", " Attempting uninstall: h5py\n", " Found existing installation: h5py 3.9.0\n", " Uninstalling h5py-3.9.0:\n", " Successfully uninstalled h5py-3.9.0\n", " Attempting uninstall: tensorboard\n", " Found existing installation: tensorboard 2.15.2\n", " Uninstalling tensorboard-2.15.2:\n", " Successfully uninstalled tensorboard-2.15.2\n", " Attempting uninstall: keras\n", " Found existing installation: keras 2.15.0\n", " Uninstalling keras-2.15.0:\n", " Successfully uninstalled keras-2.15.0\n", " Attempting uninstall: tensorflow\n", " Found existing installation: tensorflow 2.15.0\n", " Uninstalling tensorflow-2.15.0:\n", " Successfully uninstalled tensorflow-2.15.0\n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "tf-keras 2.15.1 requires tensorflow<2.16,>=2.15, but you have tensorflow 2.16.1 which is incompatible.\u001b[0m\u001b[31m\n", "\u001b[0mSuccessfully installed h5py-3.11.0 keras-3.3.3 keras-nlp-0.12.1 ml-dtypes-0.3.2 optree-0.11.0 tensorboard-2.16.2 tensorflow-2.16.1 tensorflow-text-2.16.1\n", "Requirement already satisfied: keras in /usr/local/lib/python3.10/dist-packages (3.3.3)\n", "Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from keras) (1.4.0)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from keras) (1.25.2)\n", "Requirement already satisfied: rich in /usr/local/lib/python3.10/dist-packages (from keras) (13.7.1)\n", "Requirement already satisfied: namex in /usr/local/lib/python3.10/dist-packages (from keras) (0.0.8)\n", "Requirement already satisfied: h5py in /usr/local/lib/python3.10/dist-packages (from keras) (3.11.0)\n", "Requirement already satisfied: optree in /usr/local/lib/python3.10/dist-packages (from keras) (0.11.0)\n", "Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.10/dist-packages (from keras) (0.3.2)\n", "Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from optree->keras) (4.11.0)\n", "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras) (3.0.0)\n", "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich->keras) (2.16.1)\n", "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich->keras) (0.1.2)\n" ] } ], "source": [ "!pip install --upgrade keras-cv\n", "!pip install --upgrade keras-nlp\n", "!pip install --upgrade keras" ] }, { "cell_type": "markdown", "metadata": { "id": "Z3f4OHOqUQat" }, "source": [ "## Loading PaliGemma and Configuring Image Dimensions\n", "\n", "Now we'll load the PaliGemma model itself. We'll use a preset configuration to streamline the process and ensure we have a compatible model for image captioning.\n", "\n", "Today we will be using the **pali_gemma_3b_mix_448** model, which will require our images to be 448x448 pixels... but luckily we can specify this when we load our images later.\n", "\n", ">⚠️ This is crucial as PaliGemma expects images in a specific format for accurate caption generation.\n", "\n", "For future reference, the various presets primarily differ in three aspects:\n", "\n", "1. **Image Size:**\n", " - `_224`: trained and expects input images of size 224x224 pixels. This is suitable for smaller images and less computationally demanding.\n", " - `_448`: trained and expects input images of size 448x448 pixels. This offers a balance between detail and computational cost.\n", " - `_896`: trained and expects input images of size 896x896 pixels. This provides the highest level of detail, but is more computationally intensive.\n", "2. **Training Type:**\n", " - `_pt`: *pre-trained* on a large dataset of image-text pairs. It's a good starting point for general image captioning tasks.\n", " - `_mix`: *mix fine-tuned* on a diverse set of vision-language tasks. It's expected to perform well on a wider variety of tasks, but is generally intended for research purposes only.\n", "3. **Text Sequence Length:** \\\n", "This refers to the maximum length of the generated caption. Presets with higher image sizes usually have longer text sequence lengths as they can potentially provide more detailed descriptions.\n", "\n", "At time of writing (2024/05/28), the available presets are as follows.\n", "\n", "Preset name |\tParameters |\tDescription\n", "------------|------------|----------------\n", "pali_gemma_3b_mix_224 |\t2.92B\t | image size 224, mix fine tuned, text sequence length is 256\n", "pali_gemma_3b_mix_448\t| 2.92B\t| image size 448, mix fine tuned, text sequence length is 512\n", "pali_gemma_3b_224\t| 2.92B\t| image size 224, pre trained, text sequence length is 128\n", "pali_gemma_3b_448\t| 2.92B\t| image size 448, pre trained, text sequence length is 512\n", "pali_gemma_3b_896\t| 2.93B\t| image size 896, pre trained, text sequence length is 512\n", "\n", "You can always see an up-to-date list in the Keras docs [here](https://keras.io/api/keras_nlp/models/pali_gemma/pali_gemma_causal_lm/#frompreset-method)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "abjnGy07S_6X" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/metadata.json...\n", "100%|██████████| 143/143 [00:00<00:00, 191kB/s]\n", "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/task.json...\n", "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/config.json...\n", "100%|██████████| 861/861 [00:00<00:00, 1.02MB/s]\n", "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/model.weights.h5...\n", "100%|██████████| 5.45G/5.45G [07:10<00:00, 13.6MB/s]\n", "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/preprocessor.json...\n", "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/tokenizer.json...\n", "100%|██████████| 410/410 [00:00<00:00, 494kB/s]\n", "Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/assets/tokenizer/vocabulary.spm...\n", "100%|██████████| 4.07M/4.07M [00:01<00:00, 2.41MB/s]\n" ] } ], "source": [ "import keras_nlp\n", "\n", "# load paligemma from a preset\n", "#\n", "# for more info and options to use, see the docs:\n", "# https://keras.io/api/keras_nlp/models/pali_gemma/pali_gemma_causal_lm/#frompreset-method\n", "model_name = \"pali_gemma_3b_mix_448\"\n", "pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset(model_name)\n", "\n", "# we need to resize the image to the size expected by the model\n", "# we're assuming the model name ends with _NUM here\n", "target_size_x = int(model_name[model_name.rfind(\"_\") + 1 :])\n", "target_size = (target_size_x, target_size_x)" ] }, { "cell_type": "markdown", "metadata": { "id": "EOERsneNVtod" }, "source": [ "## Loading and Preparing the Image\n", "\n", "Let's load our image and get it ready for PaliGemma. We'll use a sample image of a cat (my cat!) in this example.\n", "\n", "The code below will load the image from a URL, resize it to the dimensions expected by the PaliGemma model, and convert it into a Tensor object, which is the format required for model input." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pRpox89gl7Aw" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://jethac.github.io/assets/juice.jpg\n", "\u001b[1m251543/251543\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 0us/step\n" ] } ], "source": [ "from keras.preprocessing.image import load_img, img_to_array\n", "import tensorflow as tf\n", "\n", "# here we're loading an image of my cat because that's easier than finding a\n", "# creative commons image\n", "image_path = tf.keras.utils.get_file(\n", " \"juice.jpg\", \"https://jethac.github.io/assets/juice.jpg\"\n", ")\n", "keras_img = load_img(image_path, target_size=target_size)\n", "\n", "# convert image to NumPy array\n", "img_array = img_to_array(keras_img)\n", "\n", "# convert NumPy array to Tensor object\n", "img_tensor = tf.convert_to_tensor(img_array)" ] }, { "cell_type": "markdown", "metadata": { "id": "mEqqsC3ZNvm-" }, "source": [ "## Generating the Image Caption\n", "\n", "Finally, we'll use PaliGemma to generate a caption for our image. We'll provide the model with the image tensor and a prompt that instructs it to describe the image.\n", "\n", "Since we're not using an instruction-tuned model, we need to manually remove the prompt from the model's output to get a clean caption." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b4AskFkBmJGG" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A black and white cat sits comfortably on a black backpack, its eyes open and its paw resting on the bag. The cat's white fur and black nose are prominent features in the image. The backpack is open, revealing the cat's black and white paws and the black strap on the side. The cat's eyes are green, and its whiskers are white. The cat's head is tilted slightly towards the camera, and its ears are perked up. The cat's black and white coat is contrasted by its white chest and paws. The cat's eyes are bright and alert, and its nose is wrinkled in concentration.\n" ] } ], "source": [ "# define prompt separately so we can measure its length later\n", "prompt = \"Caption the image:\"\n", "\n", "# pass images and prompts to paligemma\n", "response = pali_gemma_lm.generate({\"images\": [img_tensor], \"prompts\": [prompt]})\n", "\n", "# we're not using an instruction-trained model so we have to cut the prompt off\n", "# the front of our output\n", "filtered = response[0][len(prompt) :]\n", "print(filtered)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "name": "[PaliGemma_1]Image_captioning.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }

PaliGemma/[PaliGemma_1]Image_captioning.ipynb (465 lines of code) (raw):